The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
# To help with reading and manipulating data
import pandas as pd
import numpy as np
# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
# To be used for missing value imputation
from sklearn.impute import SimpleImputer
from sklearn.impute import KNNImputer
# To help with model building
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
from xgboost import XGBClassifier
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To get different metric scores, and split data
from sklearn import metrics
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To be used for tuning the model
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
# This will help in making the Python code more structured automatically (good coding practice)
# %load_ext nb_black
from google.colab import drive
drive.mount('/content/drive')
%cd /content/drive/MyDrive/AI-ML
df = pd.read_csv("BankChurners.csv")
df_ccuser=df.copy()
Mounted at /content/drive /content/drive/MyDrive/AI-ML
df.shape
(10127, 21)
df_ccuser.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
#get the size of dataframe
print ("Rows : " , df_ccuser.shape[0]) #get number of rows/observations
print ("Columns : " , df_ccuser.shape[1]) #get number of columns
print ("#"*40,"\n","Features : \n\n", df_ccuser.columns.tolist()) #get name of columns/features
print ("#"*40,"\nMissing values :\n\n", df_ccuser.isnull().sum().sort_values(ascending=False))
print( "#"*40,"\nPercent of missing :\n\n", round(df_ccuser.isna().sum() / df_ccuser.isna().count() * 100, 2)) # looking at columns with most Missing Values
print ("#"*40,"\nUnique values : \n\n", df_ccuser.nunique()) # count of unique values
Rows : 10127 Columns : 21 ######################################## Features : ['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender', 'Dependent_count', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category', 'Months_on_book', 'Total_Relationship_Count', 'Months_Inactive_12_mon', 'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal', 'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt', 'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio'] ######################################## Missing values : Education_Level 1519 Marital_Status 749 CLIENTNUM 0 Contacts_Count_12_mon 0 Total_Ct_Chng_Q4_Q1 0 Total_Trans_Ct 0 Total_Trans_Amt 0 Total_Amt_Chng_Q4_Q1 0 Avg_Open_To_Buy 0 Total_Revolving_Bal 0 Credit_Limit 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Attrition_Flag 0 Months_on_book 0 Card_Category 0 Income_Category 0 Dependent_count 0 Gender 0 Customer_Age 0 Avg_Utilization_Ratio 0 dtype: int64 ######################################## Percent of missing : CLIENTNUM 0.000 Attrition_Flag 0.000 Customer_Age 0.000 Gender 0.000 Dependent_count 0.000 Education_Level 15.000 Marital_Status 7.400 Income_Category 0.000 Card_Category 0.000 Months_on_book 0.000 Total_Relationship_Count 0.000 Months_Inactive_12_mon 0.000 Contacts_Count_12_mon 0.000 Credit_Limit 0.000 Total_Revolving_Bal 0.000 Avg_Open_To_Buy 0.000 Total_Amt_Chng_Q4_Q1 0.000 Total_Trans_Amt 0.000 Total_Trans_Ct 0.000 Total_Ct_Chng_Q4_Q1 0.000 Avg_Utilization_Ratio 0.000 dtype: float64 ######################################## Unique values : CLIENTNUM 10127 Attrition_Flag 2 Customer_Age 45 Gender 2 Dependent_count 6 Education_Level 6 Marital_Status 3 Income_Category 6 Card_Category 4 Months_on_book 44 Total_Relationship_Count 6 Months_Inactive_12_mon 7 Contacts_Count_12_mon 7 Credit_Limit 6205 Total_Revolving_Bal 1974 Avg_Open_To_Buy 6813 Total_Amt_Chng_Q4_Q1 1158 Total_Trans_Amt 5033 Total_Trans_Ct 126 Total_Ct_Chng_Q4_Q1 830 Avg_Utilization_Ratio 964 dtype: int64
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
df_ccuser.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
# show data values for categorical varaible to category type
category_col = ['Attrition_Flag', 'Gender','Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category']
for column in category_col:
print(df_ccuser[column].value_counts())
print("#" * 40)
Existing Customer 8500 Attrited Customer 1627 Name: Attrition_Flag, dtype: int64 ######################################## F 5358 M 4769 Name: Gender, dtype: int64 ######################################## Graduate 3128 High School 2013 Uneducated 1487 College 1013 Post-Graduate 516 Doctorate 451 Name: Education_Level, dtype: int64 ######################################## Married 4687 Single 3943 Divorced 748 Name: Marital_Status, dtype: int64 ######################################## Less than $40K 3561 $40K - $60K 1790 $80K - $120K 1535 $60K - $80K 1402 abc 1112 $120K + 727 Name: Income_Category, dtype: int64 ######################################## Blue 9436 Silver 555 Gold 116 Platinum 20 Name: Card_Category, dtype: int64 ########################################
Observations
Sanity checks
Questions:
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)?Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)?# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
# function to plot stacked bar chart
def stacked_barplot(data, predictor, target):
"""
Print the category counts and plot a stacked bar chart
data: dataframe
predictor: independent variable
target: target variable
"""
count = data[predictor].nunique()
sorter = data[target].value_counts().index[-1]
tab1 = pd.crosstab(data[predictor], data[target], margins=True).sort_values(
by=sorter, ascending=False
)
print(tab1)
print("-" * 120)
tab = pd.crosstab(data[predictor], data[target], normalize="index").sort_values(
by=sorter, ascending=False
)
tab.plot(kind="bar", stacked=True, figsize=(count + 1, 5))
plt.legend(
loc="lower left", frameon=False,
)
plt.legend(loc="upper left", bbox_to_anchor=(1, 1))
plt.show()
### Function to plot distributions
def distribution_plot_wrt_target(data, predictor, target):
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
target_uniq = data[target].unique()
axs[0, 0].set_title("Distribution of target for target=" + str(target_uniq[0]))
sns.histplot(
data=data[data[target] == target_uniq[0]],
x=predictor,
kde=True,
ax=axs[0, 0],
color="teal",
)
axs[0, 1].set_title("Distribution of target for target=" + str(target_uniq[1]))
sns.histplot(
data=data[data[target] == target_uniq[1]],
x=predictor,
kde=True,
ax=axs[0, 1],
color="orange",
)
axs[1, 0].set_title("Boxplot w.r.t target")
sns.boxplot(data=data, x=target, y=predictor, ax=axs[1, 0], palette="gist_rainbow")
axs[1, 1].set_title("Boxplot (without outliers) w.r.t target")
sns.boxplot(
data=data,
x=target,
y=predictor,
ax=axs[1, 1],
showfliers=False,
palette="gist_rainbow",
)
plt.tight_layout()
plt.show()
# Observations on Customer_age
histogram_boxplot(df_ccuser, "Customer_Age")
# Observations on Dependent_count
histogram_boxplot(df_ccuser, "Dependent_count")
# Observations on Months_on_book
histogram_boxplot(df_ccuser, "Months_on_book")
# Observations on Total_Relationship_Count
histogram_boxplot(df_ccuser, "Total_Relationship_Count")
# Observations on Months_Inactive_12_mon
histogram_boxplot(df_ccuser, "Months_Inactive_12_mon")
# Observations on Contacts_Count_12_mon
histogram_boxplot(df_ccuser, "Contacts_Count_12_mon")
# Observations on Credit_Limit
histogram_boxplot(df_ccuser, "Credit_Limit")
# Observations on Total_Revolving_Bal
histogram_boxplot(df_ccuser, "Total_Revolving_Bal")
# Observations on Avg_Open_To_Buy
histogram_boxplot(df_ccuser, "Avg_Open_To_Buy")
# Observations on Total_Amt_Chng_Q4_Q1
histogram_boxplot(df_ccuser, "Total_Amt_Chng_Q4_Q1")
# Observations on Total_Trans_Amt
histogram_boxplot(df_ccuser, "Total_Trans_Amt")
# Observations on Total_Trans_Ct
histogram_boxplot(df_ccuser, "Total_Trans_Ct")
# Observations on Total_Ct_Chng_Q4_Q1
histogram_boxplot(df_ccuser, "Total_Ct_Chng_Q4_Q1")
# Observations on Avg_Utilization_Ratio
histogram_boxplot(df_ccuser, "Avg_Utilization_Ratio")
# observations on Attrition Flag
labeled_barplot(df_ccuser, "Attrition_Flag")
# observations on Saving accounts
labeled_barplot(df_ccuser, "Gender")
# observations on Education_Level
labeled_barplot(df_ccuser, "Education_Level")
# observations on Marital_Status
labeled_barplot(df_ccuser, "Marital_Status")
# observations on Income_Category
labeled_barplot(df_ccuser, "Income_Category")
# observations on Card_Category
labeled_barplot(df_ccuser, "Card_Category")
sns.pairplot(df_ccuser, hue="Attrition_Flag", corner=True)
<seaborn.axisgrid.PairGrid at 0x79e8e2ef6020>
plt.figure(figsize=(15, 7))
sns.heatmap(df_ccuser.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
The age of the customer and the number of books they have are strongly correlated, indicating a potential relationship between age and reading habits.
The credit limit and average utilization ratio exhibit a negative correlation, suggesting that as the credit limit increases, the average utilization ratio tends to decrease.
There is a positive correlation between the total revolving balance and average utilization, indicating that customers with higher average utilization tend to have higher total revolving balances.
The average opening balance is negatively correlated with the average utilization ratio, implying that as the average opening balance increases, the average utilization ratio tends to decrease.
There is very little correlation between the total transfer amount and the credit limit, suggesting that the credit limit does not significantly impact the total transfer amount.
As expected, there is a high correlation between the total transfer amount and the total transfer count, indicating that customers with more transfer transactions tend to have higher total transfer amounts.
The credit limit and average open to buy are fully correlated, indicating a strong relationship between these two variables. Consider dropping one of them to avoid redundancy.
It is logical that the total transaction amount is correlated with the total amount change and the total count change, as these features may be derived from the total transaction amount. Consider dropping one of these columns to avoid duplication.
These observations are based on the correlations found in the data and provide insights for further analysis and feature selection.
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Customer_Age", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Customer_Age'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Dependent_count", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Dependent_count'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Relationship_Count", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Total_Relationship_Count'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Months_Inactive_12_mon", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Months_Inactive_12_mon'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Contacts_Count_12_mon", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Contacts_Count_12_mon'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Credit_Limit", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Credit_Limit'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Revolving_Bal", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Total_Revolving_Bal'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Amt", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Total_Trans_Amt'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Total_Trans_Ct", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Total_Trans_Ct'>
sns.set(rc={"figure.figsize": (10, 7)})
sns.boxplot(x="Attrition_Flag", y="Avg_Utilization_Ratio", data=df_ccuser, orient="vertical")
<Axes: xlabel='Attrition_Flag', ylabel='Avg_Utilization_Ratio'>
## Converting the data type of categorical features to 'category'
cat_cols = ['Attrition_Flag','Gender', 'Education_Level', 'Marital_Status', 'Income_Category', 'Card_Category','Dependent_count','Total_Relationship_Count','Months_Inactive_12_mon','Contacts_Count_12_mon']
df_ccuser[cat_cols] = df_ccuser[cat_cols].astype('category')
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null category 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null category 4 Dependent_count 10127 non-null category 5 Education_Level 8608 non-null category 6 Marital_Status 9378 non-null category 7 Income_Category 10127 non-null category 8 Card_Category 10127 non-null category 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null category 11 Months_Inactive_12_mon 10127 non-null category 12 Contacts_Count_12_mon 10127 non-null category 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(10), float64(5), int64(6) memory usage: 971.4 KB
df_ccuser.describe(include=['category']).T
| count | unique | top | freq | |
|---|---|---|---|---|
| Attrition_Flag | 10127 | 2 | Existing Customer | 8500 |
| Gender | 10127 | 2 | F | 5358 |
| Dependent_count | 10127 | 6 | 3 | 2732 |
| Education_Level | 8608 | 6 | Graduate | 3128 |
| Marital_Status | 9378 | 3 | Married | 4687 |
| Income_Category | 10127 | 6 | Less than $40K | 3561 |
| Card_Category | 10127 | 4 | Blue | 9436 |
| Total_Relationship_Count | 10127 | 6 | 3 | 2305 |
| Months_Inactive_12_mon | 10127 | 7 | 3 | 3846 |
| Contacts_Count_12_mon | 10127 | 7 | 3 | 3380 |
df_ccuser['Agebin'] = pd.cut(df_ccuser['Customer_Age'], bins = [25, 35,45,55,65, 75], labels = ['25-35', '36-45', '46-55', '56-65','66-75'])
df_ccuser.Agebin.value_counts()
46-55 4135 36-45 3742 56-65 1321 25-35 919 66-75 10 Name: Agebin, dtype: int64
# Making a list of all categorical variables
plt.figure(figsize=(14,17))
sns.set_theme(style="white")
for i, variable in enumerate(cat_cols):
plt.subplot(9,2,i+1)
order = df_ccuser[variable].value_counts(ascending=False).index
sns.set_palette('twilight_shifted')
ax=sns.countplot(x=df_ccuser[variable], data=df_ccuser )
sns.despine(top=True,right=True,left=True)
for p in ax.patches:
percentage = '{:.1f}%'.format(100 * p.get_height()/len(df_ccuser[variable]))
x = p.get_x() + p.get_width() / 2 - 0.05
y = p.get_y() + p.get_height()
plt.annotate(percentage, (x, y),ha='center')
plt.tight_layout()
plt.title(cat_cols[i].upper())
Approximately 16% of credit card customers have attrited, indicating a significant portion of customers who have discontinued their credit card services.
Around 52% of the credit card customers are female, highlighting the majority gender demographic among credit card holders.
Around 30% of the customers are graduates, while the number of post-graduates and doctorate holders is relatively low, indicating a lower representation of higher educational degrees among the customer base.
Approximately 46% of the credit card customers are married. However, there is a 7.4% unknown marital status which requires imputation or further investigation.
Around 35% of the customers earn less than 40k, indicating a significant portion of customers with lower income levels.
Approximately 93% of the customers hold a blue card, suggesting that the majority of customers have a standard card type. Conversely, there is a low percentage of customers with platinum cards, indicating a smaller group with premium card benefits.
Around 22% of the customers have more than three bank products, indicating a portion of customers who utilize multiple banking services.
Approximately 38% of the customers have been inactive for three months, and it would be worthwhile to investigate customers who have been inactive for four, five, or six months to determine any potential relationship with attrition.
Around 60% of the customers were contacted 2-3 times within a 12-month period, indicating a common frequency of communication between the credit card company and the customers.
plt.figure(figsize=(10,5))
sns.set_palette(sns.color_palette("tab20", 8))
sns.barplot(y='Credit_Limit',x='Income_Category',hue='Attrition_Flag',data=df_ccuser)
sns.despine(top=True,right=True,left=True) # to remove side line from graph
plt.legend(bbox_to_anchor=(1.00, 1))
plt.title('Income vs credit')
Text(0.5, 1.0, 'Income vs credit')
cat_cols.append("Agebin")
for variable in cat_cols:
stacked_barplot(df_ccuser, variable, "Attrition_Flag")
Attrition_Flag Attrited Customer Existing Customer All Attrition_Flag Attrited Customer 1627 0 1627 All 1627 8500 10127 Existing Customer 0 8500 8500 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Gender All 1627 8500 10127 F 930 4428 5358 M 697 4072 4769 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Education_Level All 1371 7237 8608 Graduate 487 2641 3128 High School 306 1707 2013 Uneducated 237 1250 1487 College 154 859 1013 Doctorate 95 356 451 Post-Graduate 92 424 516 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Marital_Status All 1498 7880 9378 Married 709 3978 4687 Single 668 3275 3943 Divorced 121 627 748 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Income_Category All 1627 8500 10127 Less than $40K 612 2949 3561 $40K - $60K 271 1519 1790 $80K - $120K 242 1293 1535 $60K - $80K 189 1213 1402 abc 187 925 1112 $120K + 126 601 727 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Card_Category All 1627 8500 10127 Blue 1519 7917 9436 Silver 82 473 555 Gold 21 95 116 Platinum 5 15 20 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Dependent_count All 1627 8500 10127 3 482 2250 2732 2 417 2238 2655 1 269 1569 1838 4 260 1314 1574 0 135 769 904 5 64 360 424 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Total_Relationship_Count All 1627 8500 10127 3 400 1905 2305 2 346 897 1243 1 233 677 910 5 227 1664 1891 4 225 1687 1912 6 196 1670 1866 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Months_Inactive_12_mon All 1627 8500 10127 3 826 3020 3846 2 505 2777 3282 4 130 305 435 1 100 2133 2233 5 32 146 178 6 19 105 124 0 15 14 29 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Contacts_Count_12_mon All 1627 8500 10127 3 681 2699 3380 2 403 2824 3227 4 315 1077 1392 1 108 1391 1499 5 59 117 176 6 54 0 54 0 7 392 399 ------------------------------------------------------------------------------------------------------------------------
Attrition_Flag Attrited Customer Existing Customer All Agebin All 1627 8500 10127 46-55 688 3447 4135 36-45 606 3136 3742 56-65 209 1112 1321 25-35 122 797 919 66-75 2 8 10 ------------------------------------------------------------------------------------------------------------------------
df_ccuser.drop(['CLIENTNUM'],axis=1,inplace=True)
df_ccuser.drop(['Agebin'],axis=1,inplace=True)
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null category 10 Months_Inactive_12_mon 10127 non-null category 11 Contacts_Count_12_mon 10127 non-null category 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(10), float64(5), int64(5) memory usage: 892.3 KB
df_ccuser = df_ccuser.replace({'Unknown': None})
df_ccuser.isnull().sum()
Attrition_Flag 0 Customer_Age 0 Gender 0 Dependent_count 0 Education_Level 1519 Marital_Status 749 Income_Category 0 Card_Category 0 Months_on_book 0 Total_Relationship_Count 0 Months_Inactive_12_mon 0 Contacts_Count_12_mon 0 Credit_Limit 0 Total_Revolving_Bal 0 Avg_Open_To_Buy 0 Total_Amt_Chng_Q4_Q1 0 Total_Trans_Amt 0 Total_Trans_Ct 0 Total_Ct_Chng_Q4_Q1 0 Avg_Utilization_Ratio 0 dtype: int64
df_ccuser.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Attrition_Flag 10127 non-null category 1 Customer_Age 10127 non-null int64 2 Gender 10127 non-null category 3 Dependent_count 10127 non-null category 4 Education_Level 8608 non-null category 5 Marital_Status 9378 non-null category 6 Income_Category 10127 non-null category 7 Card_Category 10127 non-null category 8 Months_on_book 10127 non-null int64 9 Total_Relationship_Count 10127 non-null category 10 Months_Inactive_12_mon 10127 non-null category 11 Contacts_Count_12_mon 10127 non-null category 12 Credit_Limit 10127 non-null float64 13 Total_Revolving_Bal 10127 non-null int64 14 Avg_Open_To_Buy 10127 non-null float64 15 Total_Amt_Chng_Q4_Q1 10127 non-null float64 16 Total_Trans_Amt 10127 non-null int64 17 Total_Trans_Ct 10127 non-null int64 18 Total_Ct_Chng_Q4_Q1 10127 non-null float64 19 Avg_Utilization_Ratio 10127 non-null float64 dtypes: category(10), float64(5), int64(5) memory usage: 892.3 KB
df_ccuser.head()
| Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
# Label Encode categorical variables
attrition = {'Existing Customer':0, 'Attrited Customer':1}
df_ccuser['Attrition_Flag']=df_ccuser['Attrition_Flag'].map(attrition)
marital_status = {'Married':1,'Single':2, 'Divorced':3}
df_ccuser['Marital_Status']=df_ccuser['Marital_Status'].map(marital_status)
education = {'Uneducated':1,'High School':2, 'Graduate':3, 'College':4, 'Post-Graduate':5, 'Doctorate':6}
df_ccuser['Education_Level']=df_ccuser['Education_Level'].map(education)
income = {'Less than $40K':1,'$40K - $60K':2, '$60K - $80K':3, '$80K - $120K':4, '$120K +':5}
df_ccuser['Income_Category']=df_ccuser['Income_Category'].map(income)
df = df_ccuser.copy()
X = df.drop(["Attrition_Flag"], axis=1)
y = df["Attrition_Flag"]
# Create an instance of SimpleImputer with the mean strategy
imputer = SimpleImputer(strategy='mean')
# Fit and transform the imputer on the target variable
y = imputer.fit_transform(df_ccuser[['Attrition_Flag']])
# Let's impute the missing values
imp_mode = SimpleImputer(missing_values=np.nan, strategy="most_frequent")
cols_to_impute = ['Income_Category','Education_Level','Marital_Status']
# fit and transform the imputer on train data
X[cols_to_impute] = imp_mode.fit_transform(X[cols_to_impute])
# Splitting data into training, validation and test sets:
# first we split data into 2 parts, say temporary and test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=1, stratify=y
)
# then we split the temporary set into train and validation
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=1, stratify=y_temp
)
print(X_train.shape, X_val.shape, X_test.shape)
(6075, 19) (2026, 19) (2026, 19)
# Creating dummy variables for categorical variables
X_train = pd.get_dummies(data=X_train, drop_first=True)
X_val = pd.get_dummies(data=X_val, drop_first=True)
X_test = pd.get_dummies(data=X_test, drop_first=True)
The nature of predictions made by the classification model will translate as follows:
Which metric to optimize?
Let's define a function to output different metrics (including recall) on the train and test set and a function to show confusion matrix so that we do not have to use the same code repetitively while evaluating models.
import pandas as pd
model_comparison_table = pd.DataFrame(columns=['Model', 'Accuracy', 'Recall', 'Precision', 'F1']);
def update_model_comparison(model_name, accuracy, recall, precision, f1, df=None):
# Create a new DataFrame if one is not provided
if df is None:
df = pd.DataFrame(columns=['Model', 'Accuracy', 'Recall', 'Precision', 'F1'])
# Append the new model's performance metrics to the DataFrame
new_row = {'Model': model_name, 'Accuracy': accuracy, 'Recall': recall, 'Precision': precision, 'F1': f1}
df = df.append(new_row, ignore_index=True)
#print(df)
# defining a function to compute different metrics to check performance of a classification model built using sklearn
def model_performance_classification_sklearn(name, model, predictors, target):
"""
Function to compute different metrics to check classification model performance
model: classifier
predictors: independent variables
target: dependent variable
"""
# predicting using the independent variables
pred = model.predict(predictors)
acc = accuracy_score(target, pred) # to compute Accuracy
recall = recall_score(target, pred) # to compute Recall
precision = precision_score(target, pred) # to compute Precision
f1 = f1_score(target, pred) # to compute F1-score
#update_model_comparison(name, acc, recall, precision, f1)
# creating a dataframe of metrics
df_perf = pd.DataFrame(
{
"Accuracy": acc,
"Recall": recall,
"Precision": precision,
"F1": f1
},
index=[0],
)
print(name)
print(df_perf)
return df_perf
def make_confusion_matrix(y_actual,y_predict,title):
cm = confusion_matrix(y_actual, y_predict)
# Define class labels
class_labels = ['Class 0', 'Class 1']
# Create a heatmap using seaborn
sns.heatmap(cm, annot=True, cmap='Blues', fmt='d', xticklabels=class_labels, yticklabels=class_labels)
# Add labels, title, and axis ticks
plt.xlabel('Predicted')
plt.ylabel('True')
plt.title(title)
plt.xticks(ticks=[0.5, 1.5], labels=class_labels)
plt.yticks(ticks=[0.5, 1.5], labels=class_labels)
# Show the plot
plt.show()
models = [] # Empty list to store all the models
# Appending models into the list
models.append(("Bagging", BaggingClassifier(random_state=1)))
models.append(("Decision Tree", DecisionTreeClassifier(random_state=1)))
models.append(("Random Forest", RandomForestClassifier(random_state=1)))
models.append(("AdaBoost", AdaBoostClassifier(random_state=1)))
models.append(("Gradient Boosting", GradientBoostingClassifier(random_state=1)))
models.append(("XBoost", XGBClassifier(random_state=1)))
print("\n" "Training Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
model_performance_classification_sklearn(name, model, X_train, y_train)
print("\n" "Validation Performance:" "\n")
for name, model in models:
model.fit(X_train, y_train)
model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance: Bagging Accuracy Recall Precision F1 0 0.997 0.985 0.997 0.991 Decision Tree Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 Random Forest Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 AdaBoost Accuracy Recall Precision F1 0 0.958 0.840 0.892 0.865 Gradient Boosting Accuracy Recall Precision F1 0 0.974 0.878 0.954 0.915 XBoost Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 Validation Performance: Bagging Accuracy Recall Precision F1 0 0.952 0.794 0.896 0.842 Decision Tree Accuracy Recall Precision F1 0 0.936 0.794 0.804 0.799 Random Forest Accuracy Recall Precision F1 0 0.949 0.748 0.921 0.826 AdaBoost Accuracy Recall Precision F1 0 0.956 0.822 0.893 0.856 Gradient Boosting Accuracy Recall Precision F1 0 0.964 0.825 0.947 0.882 XBoost Accuracy Recall Precision F1 0 0.974 0.899 0.936 0.917
The training performance of the models was impressive. Bagging, Decision Tree, and XGBoost achieved perfect accuracy, recall, precision, and F1 scores of 1.000. Random Forest and AdaBoost also achieved high scores, with accuracy and recall both at 1.000, and precision and F1 scores above 0.890. Gradient Boosting had an accuracy of 0.974, recall of 0.878, precision of 0.954, and F1 score of 0.915.
On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.952, recall of 0.794, precision of 0.896, and F1 score of 0.842. Decision Tree had an accuracy of 0.936, recall of 0.794, precision of 0.804, and F1 score of 0.799. Random Forest achieved an accuracy of 0.949, recall of 0.748, precision of 0.921, and F1 score of 0.826. AdaBoost had an accuracy of 0.956, recall of 0.822, precision of 0.893, and F1 score of 0.856. Gradient Boosting achieved an accuracy of 0.964, recall of 0.825, precision of 0.947, and F1 score of 0.882. XGBoost had the highest accuracy of 0.974, recall of 0.899, precision of 0.936, and F1 score of 0.917.
# Synthetic Minority Over Sampling Technique
print(f"Before OverSampling, counts of label attrited customer: {sum(y_train==1)}")
print(f"Before OverSampling, counts of label existing customer: {sum(y_train==0)} \n")
sm = SMOTE(sampling_strategy = 1 ,k_neighbors = 5, random_state=1)
X_train_over, y_train_over = sm.fit_resample(X_train, y_train.ravel())
print(f"After OverSampling, counts of label attrited customer: {sum(y_train_over==1)}")
print(f"After OverSampling, counts of label existing customer: {sum(y_train_over==0)} \n")
print(f'After OverSampling, the shape of train_X: {X_train_over.shape}')
print(f'After OverSampling, the shape of train_y: {y_train_over.shape} \n')
Before OverSampling, counts of label attrited customer: [976] Before OverSampling, counts of label existing customer: [5099] After OverSampling, counts of label attrited customer: 5099 After OverSampling, counts of label existing customer: 5099 After OverSampling, the shape of train_X: (10198, 39) After OverSampling, the shape of train_y: (10198,)
print("\n" "Training Performance Atfer OverSampling:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
model_performance_classification_sklearn(name, model, X_train, y_train)
print("\n" "Validation Performance After Oversampling:" "\n")
for name, model in models:
model.fit(X_train_over, y_train_over)
model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance Atfer OverSampling: Bagging Accuracy Recall Precision F1 0 0.998 0.993 0.995 0.994 Decision Tree Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 Random Forest Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 AdaBoost Accuracy Recall Precision F1 0 0.949 0.856 0.831 0.843 Gradient Boosting Accuracy Recall Precision F1 0 0.971 0.916 0.904 0.910 XBoost Accuracy Recall Precision F1 0 1.000 1.000 1.000 1.000 Validation Performance After Oversampling: Bagging Accuracy Recall Precision F1 0 0.944 0.837 0.817 0.827 Decision Tree Accuracy Recall Precision F1 0 0.924 0.782 0.757 0.769 Random Forest Accuracy Recall Precision F1 0 0.952 0.825 0.868 0.846 AdaBoost Accuracy Recall Precision F1 0 0.950 0.874 0.824 0.848 Gradient Boosting Accuracy Recall Precision F1 0 0.962 0.902 0.865 0.883 XBoost Accuracy Recall Precision F1 0 0.970 0.893 0.918 0.905
After applying oversampling techniques to the training data, the models were retrained and evaluated on both the training and validation sets.
In terms of training performance, Bagging, Decision Tree, and Random Forest achieved perfect accuracy, recall, precision, and F1 scores of 1.000, indicating their ability to accurately identify positive cases of customer attrition. AdaBoost achieved an accuracy of 0.949, recall of 0.856, precision of 0.831, and F1 score of 0.843, while Gradient Boosting achieved an accuracy of 0.971, recall of 0.916, precision of 0.904, and F1 score of 0.910. XGBoost also achieved perfect scores of 1.000 for all metrics.
On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.944, recall of 0.837, precision of 0.817, and F1 score of 0.827. Decision Tree had an accuracy of 0.924, recall of 0.782, precision of 0.757, and F1 score of 0.769. Random Forest achieved an accuracy of 0.952, recall of 0.825, precision of 0.868, and F1 score of 0.846. AdaBoost had an accuracy of 0.950, recall of 0.874, precision of 0.824, and F1 score of 0.848. Gradient Boosting achieved an accuracy of 0.962, recall of 0.902, precision of 0.865, and F1 score of 0.883. XGBoost had the highest accuracy of 0.970, recall of 0.893, precision of 0.918, and F1 score of 0.905.
These results indicate that the models, after oversampling the data, maintained their strong performance in identifying positive cases of customer attrition.
# Random undersampler for under sampling the data
rus = RandomUnderSampler(random_state=1, sampling_strategy=1)
X_train_un, y_train_un = rus.fit_resample(X_train, y_train)
rus = RandomUnderSampler(random_state = 1) # Undersample dependent variable
X_train_under, y_train_under = rus.fit_resample(X_train, y_train)
#Undersample to balance classes
print("Before Under Sampling, counts of label 'Attrited': {}".format(sum(y_train==1)))
print("Before Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train==0)))
print("After Under Sampling, counts of label 'Attrited': {}".format(sum(y_train_under==1)))
print("After Under Sampling, counts of label 'Existing': {} \n".format(sum(y_train_under==0)))
print('After Under Sampling, the shape of train_X: {}'.format(X_train_under.shape))
print('After Under Sampling, the shape of train_y: {} \n'.format(y_train_under.shape))
Before Under Sampling, counts of label 'Attrited': [976] Before Under Sampling, counts of label 'Existing': [5099] After Under Sampling, counts of label 'Attrited': 976 After Under Sampling, counts of label 'Existing': 976 After Under Sampling, the shape of train_X: (1952, 39) After Under Sampling, the shape of train_y: (1952,)
print("\n" "Training Performance Atfer UnderSampling:" "\n")
for name, model in models:
model.fit(X_train_under, y_train_under)
model_performance_classification_sklearn(name, model, X_train, y_train)
print("\n" "Validation Performance After Undersampling:" "\n")
for name, model in models:
model.fit(X_train_under, y_train_under)
model_performance_classification_sklearn(name, model, X_val, y_val)
Training Performance Atfer UnderSampling: Bagging Accuracy Recall Precision F1 0 0.942 0.990 0.739 0.846 Decision Tree Accuracy Recall Precision F1 0 0.916 1.000 0.656 0.793 Random Forest Accuracy Recall Precision F1 0 0.942 1.000 0.735 0.847 AdaBoost Accuracy Recall Precision F1 0 0.929 0.950 0.708 0.811 Gradient Boosting Accuracy Recall Precision F1 0 0.939 0.976 0.734 0.838 XBoost Accuracy Recall Precision F1 0 0.958 1.000 0.791 0.883 Validation Performance After Undersampling: Bagging Accuracy Recall Precision F1 0 0.919 0.908 0.687 0.782 Decision Tree Accuracy Recall Precision F1 0 0.882 0.887 0.587 0.707 Random Forest Accuracy Recall Precision F1 0 0.929 0.933 0.714 0.809 AdaBoost Accuracy Recall Precision F1 0 0.923 0.948 0.691 0.799 Gradient Boosting Accuracy Recall Precision F1 0 0.933 0.948 0.724 0.821 XBoost Accuracy Recall Precision F1 0 0.938 0.957 0.736 0.832
After applying undersampling techniques to the training data, the models were retrained and evaluated on both the training and validation sets.
In terms of training performance, Bagging achieved an accuracy of 0.942, recall of 0.990, precision of 0.739, and F1 score of 0.846. Decision Tree achieved an accuracy of 0.916, recall of 1.000, precision of 0.656, and F1 score of 0.793. Random Forest achieved an accuracy of 0.942, recall of 1.000, precision of 0.735, and F1 score of 0.847. AdaBoost achieved an accuracy of 0.929, recall of 0.950, precision of 0.708, and F1 score of 0.811. Gradient Boosting achieved an accuracy of 0.939, recall of 0.976, precision of 0.734, and F1 score of 0.838. XGBoost achieved an accuracy of 0.958, recall of 1.000, precision of 0.791, and F1 score of 0.883.
On the validation set, the models' performance remained strong. Bagging achieved an accuracy of 0.919, recall of 0.908, precision of 0.687, and F1 score of 0.782. Decision Tree had an accuracy of 0.882, recall of 0.887, precision of 0.587, and F1 score of 0.707. Random Forest achieved an accuracy of 0.929, recall of 0.933, precision of 0.714, and F1 score of 0.809. AdaBoost had an accuracy of 0.923, recall of 0.948, precision of 0.691, and F1 score of 0.799. Gradient Boosting achieved an accuracy of 0.933, recall of 0.948, precision of 0.724, and F1 score of 0.821. XGBoost had an accuracy of 0.938, recall of 0.957, precision of 0.736, and F1 score of 0.832.
Overall, the undersampling technique affected the model performance. While the models still achieved relatively high accuracy and recall scores, the precision and F1 scores decreased compared to the previous results.
Based on the recall performance of the our models , we can choose the top three models for further hyperparameter tuning to potentially improve their performance. Here are the three models, based on their recall scores in the validation performance:
Note
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
param_grid = {
'max_samples': [0.8,0.9,1],
'max_features': [0.7,0.8,0.9],
'n_estimators' : [30,50,70],
}
param_grid = {
"n_estimators": [50,110,25],
"min_samples_leaf": np.arange(1, 4),
"max_features": [np.arange(0.3, 0.6, 0.1),'sqrt'],
"max_samples": np.arange(0.4, 0.7, 0.1)
}
param_grid = {
'max_depth': np.arange(2,6),
'min_samples_leaf': [1, 4, 7],
'max_leaf_nodes' : [10, 15],
'min_impurity_decrease': [0.0001,0.001]
}
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Hyperpameter tuning AdaBoost original data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)
# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train, y_train)
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)
print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", clf, X_train, y_train)
print("\n" "Validation Performance AdaBoost After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Training Performance AdaBoost Atfer Hypeparameter tuning original data:
Accuracy Recall Precision F1
0 0.980 0.916 0.959 0.937
Validation Performance AdaBoost After Hyperparamater tuning original data:
Accuracy Recall Precision F1
0 0.965 0.847 0.932 0.887
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.965 | 0.847 | 0.932 | 0.887 |
# Hyperpameter tuning GradientBoostingClassifier original data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=gb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train, y_train)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_train, y_train)
print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}
Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning original data:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
Validation Performance GradientBoostingClassifier After Hyperparamater tuning original data:
Accuracy Recall Precision F1
0 0.936 0.794 0.804 0.799
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.936 | 0.794 | 0.804 | 0.799 |
# Hyperpameter tuning XBoost original data
# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=xgb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train, y_train)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance XBoost Atfer Hypeparameter tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_train, y_train)
print("\n" "Validation Performance XBoost After Hyperparamater tuning original data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}
Training Performance XBoost Atfer Hypeparameter tuning original data:
Accuracy Recall Precision F1
0 0.983 0.928 0.965 0.946
Validation Performance XBoost After Hyperparamater tuning original data:
Accuracy Recall Precision F1
0 0.967 0.862 0.930 0.895
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.967 | 0.862 | 0.930 | 0.895 |
# Hyperpameter tuning AdaBoost oversampled data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)
# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train_over, y_train_over)
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)
print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", clf, X_train_over, y_train_over)
print("\n" "Validation Performance AdaBoost After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 50, 'learning_rate': 0.05, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Training Performance AdaBoost Atfer Hypeparameter tuning oversampled data:
Accuracy Recall Precision F1
0 0.963 0.972 0.954 0.963
Validation Performance AdaBoost After Hyperparamater tuning oversampled data:
Accuracy Recall Precision F1
0 0.938 0.868 0.775 0.819
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.938 | 0.868 | 0.775 | 0.819 |
# Hyperpameter tuning GradientBoostingClassifier oversampled data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=gb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train_over, y_train_over)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_over, y_train_over)
print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}
Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning oversampled data:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
Validation Performance GradientBoostingClassifier After Hyperparamater tuning oversampled data:
Accuracy Recall Precision F1
0 0.924 0.782 0.757 0.769
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.924 | 0.782 | 0.757 | 0.769 |
# Hyperpameter tuning XBoost oversampled data
# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=xgb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train_over, y_train_over)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance XBoost Atfer Hypeparameter tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_over, y_train_over)
print("\n" "Validation Performance XBoost After Hyperparamater tuning oversampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 50, 'learning_rate': 0.05, 'gamma': 3}
Training Performance XBoost Atfer Hypeparameter tuning oversampled data:
Accuracy Recall Precision F1
0 0.979 0.982 0.976 0.979
Validation Performance XBoost After Hyperparamater tuning oversampled data:
Accuracy Recall Precision F1
0 0.950 0.902 0.810 0.853
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.950 | 0.902 | 0.810 | 0.853 |
# Hyperpameter tuning AdaBoost undersampled data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"base_estimator": [
DecisionTreeClassifier(max_depth=2, random_state=1),
DecisionTreeClassifier(max_depth=3, random_state=1),
],
}
# Create the AdaBoost classifier
adaboost = AdaBoostClassifier(random_state=1)
# Perform hyperparameter tuning using RandomizedSearchCV
clf = RandomizedSearchCV(estimator=adaboost, param_distributions=param_grid, cv=5, random_state=1, n_iter=10)
clf.fit(X_train_under, y_train_under)
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(clf.best_params_)
print("\n" "Training Performance AdaBoost Atfer Hypeparameter tuning undersampled data:" "\n")
model_performance_classification_sklearn("", clf, X_train_under, y_train_under)
print("\n" "Validation Performance AdaBoost After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", clf, X_val, y_val)
Best Hyperparameters:
{'n_estimators': 100, 'learning_rate': 0.1, 'base_estimator': DecisionTreeClassifier(max_depth=3, random_state=1)}
Training Performance AdaBoost Atfer Hypeparameter tuning undersampled data:
Accuracy Recall Precision F1
0 0.988 0.995 0.981 0.988
Validation Performance AdaBoost After Hyperparamater tuning undersampled data:
Accuracy Recall Precision F1
0 0.936 0.957 0.731 0.829
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.936 | 0.957 | 0.731 | 0.829 |
# Hyperpameter tuning GradientBoostingClassifier undersampled data
# Define the parameter grid for hyperparameter tuning
param_grid = {
"init": [AdaBoostClassifier(random_state=1),DecisionTreeClassifier(random_state=1)],
"n_estimators": np.arange(50,110,25),
"learning_rate": [0.01,0.1,0.05],
"subsample":[0.7,0.9],
"max_features":[0.5,0.7,1],
}
# Initialize the GradientBoost model
gb = GradientBoostingClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=gb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning undersampled dataa:" "\n")
model_performance_classification_sklearn("", best_model, X_train_under, y_train_under)
print("\n" "Validation Performance GradientBoostingClassifier After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'n_estimators': 100, 'max_features': 0.7, 'learning_rate': 0.05, 'init': DecisionTreeClassifier(random_state=1)}
Training Performance GradientBoostingClassifier Atfer Hypeparameter tuning undersampled dataa:
Accuracy Recall Precision F1
0 1.000 1.000 1.000 1.000
Validation Performance GradientBoostingClassifier After Hyperparamater tuning undersampled data:
Accuracy Recall Precision F1
0 0.882 0.887 0.587 0.707
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.882 | 0.887 | 0.587 | 0.707 |
# Hyperpameter tuning XBoost undersampled data
# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=xgb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Training Performance XBoost Atfer Hypeparameter tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_train_under, y_train_under)
print("\n" "Validation Performance XBoost After Hyperparamater tuning undersampled data:" "\n")
model_performance_classification_sklearn("", best_model, X_val, y_val)
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}
Training Performance XBoost Atfer Hypeparameter tuning undersampled data:
Accuracy Recall Precision F1
0 0.986 0.994 0.979 0.986
Validation Performance XBoost After Hyperparamater tuning undersampled data:
Accuracy Recall Precision F1
0 0.935 0.960 0.725 0.826
| Accuracy | Recall | Precision | F1 | |
|---|---|---|---|---|
| 0 | 0.935 | 0.960 | 0.725 | 0.826 |
comparison_frame = pd.DataFrame({'Model':['Bagging original',
'Decision Tree original',
'Random Forest original',
'AdaBoost original',
'Gradient Boosting original',
'XBoost original',
'Bagging OverSampling',
'Decision Tree OverSampling',
'Random Forest OverSampling',
'AdaBoost OverSampling',
'Gradient Boosting OverSampling',
'XBoost OverSampling',
'Bagging UnderSampling',
'Decision Tree UnderSampling',
'Random Forest UnderSampling',
'AdaBoost UnderSampling',
'Gradient Boosting UnderSampling',
'XBoost UnderSampling',
'AdaBoost hp tunning original data',
'GradientBoost hp tunning original data',
'XBoost hp tunning original data',
'AdaBoost hp tunning oversampled data',
'GradientBoost hp tunning oversampled data',
'XBoost hp tunning oversampled data',
'AdaBoost hp tunning undersampled data',
'GradientBoost hp tunning undersampled data',
'XBoost hp tunning undersampled data'],
'train recall':[0.98,1.0,1.0,0.84,0.87,1.0,0.99,1.0,1.0,0.85,0.91,1.0,0.99,1.0,1.0,0.95,0.97,1.0, 0.92,1.0,0.93,0.97,1.0,0.98,0.99,1.0,0.99],
'validation recall':[0.79,0.79,0.74,0.82,0.82,0.89,0.83,0.78,0.82,0.87,0.90,0.89,0.90,0.88,0.93,0.94,0.94,0.95, 0.85,0.79,0.86,0.87,0.78,0.90,0.95,0.87,0.96]})
comparison_frame
| Model | train recall | validation recall | |
|---|---|---|---|
| 0 | Bagging original | 0.980 | 0.790 |
| 1 | Decision Tree original | 1.000 | 0.790 |
| 2 | Random Forest original | 1.000 | 0.740 |
| 3 | AdaBoost original | 0.840 | 0.820 |
| 4 | Gradient Boosting original | 0.870 | 0.820 |
| 5 | XBoost original | 1.000 | 0.890 |
| 6 | Bagging OverSampling | 0.990 | 0.830 |
| 7 | Decision Tree OverSampling | 1.000 | 0.780 |
| 8 | Random Forest OverSampling | 1.000 | 0.820 |
| 9 | AdaBoost OverSampling | 0.850 | 0.870 |
| 10 | Gradient Boosting OverSampling | 0.910 | 0.900 |
| 11 | XBoost OverSampling | 1.000 | 0.890 |
| 12 | Bagging UnderSampling | 0.990 | 0.900 |
| 13 | Decision Tree UnderSampling | 1.000 | 0.880 |
| 14 | Random Forest UnderSampling | 1.000 | 0.930 |
| 15 | AdaBoost UnderSampling | 0.950 | 0.940 |
| 16 | Gradient Boosting UnderSampling | 0.970 | 0.940 |
| 17 | XBoost UnderSampling | 1.000 | 0.950 |
| 18 | AdaBoost hp tunning original data | 0.920 | 0.850 |
| 19 | GradientBoost hp tunning original data | 1.000 | 0.790 |
| 20 | XBoost hp tunning original data | 0.930 | 0.860 |
| 21 | AdaBoost hp tunning oversampled data | 0.970 | 0.870 |
| 22 | GradientBoost hp tunning oversampled data | 1.000 | 0.780 |
| 23 | XBoost hp tunning oversampled data | 0.980 | 0.900 |
| 24 | AdaBoost hp tunning undersampled data | 0.990 | 0.950 |
| 25 | GradientBoost hp tunning undersampled data | 1.000 | 0.870 |
| 26 | XBoost hp tunning undersampled data | 0.990 | 0.960 |
The XGBoost model tuned with undersampled data using Randomized search achieved the highest validation recall of 0.96 and train recall of 0.99
The second best is AdaBoost with undersampled data using Randomized search achieved the validation recall 0.95 and train recall of 0.99
The third best is XBoost with oversampled data using Randomized search achieved the validation recall 0.90 and train recall of 0.98
Lets test our best model XBoost with undersampled data on testing dataset
# Hyperpameter tuning XBoost undersampled data
# Define the parameter grid for hyperparameter tuning
param_grid={'n_estimators':np.arange(50,110,25),
'scale_pos_weight':[1,2,5],
'learning_rate':[0.01,0.1,0.05],
'gamma':[1,3],
'subsample':[0.7,0.9]
}
# Initialize the XGBoost model
xgb = XGBClassifier(random_state=1)
# Perform randomized search with 10 iterations
random_search = RandomizedSearchCV(
estimator=xgb,
param_distributions=param_grid,
cv=5,
random_state=1,
n_iter=10
)
# Fit the randomized search on your training data
random_search.fit(X_train_under, y_train_under)
# Get the best model and best hyperparameters
best_model = random_search.best_estimator_
best_params = random_search.best_params_
# Print the best hyperparameters found during tuning
print("Best Hyperparameters:")
print(best_params)
print("\n" "Testing Performance XBoost Atfer Hypeparameter tuning undersampled data:" "\n")
scores = recall_score(y_test, best_model.predict(X_test))
print("{}: {}".format("XBoost", scores))
model_performance_classification_sklearn(name, model, X_test, y_test)
make_confusion_matrix(y_train_under,best_model.predict(X_train_under),"Confusion Matrix for Train")
make_confusion_matrix(y_test,best_model.predict(X_test),"Confusion Matrix for Test")
Best Hyperparameters:
{'subsample': 0.7, 'scale_pos_weight': 1, 'n_estimators': 100, 'learning_rate': 0.05, 'gamma': 1}
Testing Performance XBoost Atfer Hypeparameter tuning undersampled data:
XBoost: 0.9753846153846154
XBoost
Accuracy Recall Precision F1
0 0.938 0.975 0.729 0.834
# Get the feature importances from the best model
importances = best_model.feature_importances_
# Create a DataFrame to store the feature importances
importance_df = pd.DataFrame({'Feature': X_test.columns.tolist(), 'Importance': importances})
# Sort the DataFrame by importance in descending order
importance_df = importance_df.sort_values(by='Importance', ascending=False)
# Create a bar plot of the feature importances
plt.figure(figsize=(10, 6))
plt.bar(importance_df['Feature'], importance_df['Importance'])
plt.xticks(rotation=90)
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.title('Feature Importances')
plt.tight_layout()
plt.show()
Customers who have been inactive for a month have a higher likelihood of attrition. The bank should focus on these customers and take steps to re-engage them.
A lower transaction count, less revolving balance, and smaller transaction amounts on a credit card are indicators that a customer is likely to attrite. Such customers are not actively using the credit card, so the bank should consider offering more rewards, cashback, or other incentives to encourage increased card usage.
Attrited customers tend to have a lower average utilization ratio, indicating that they are not using their credit card to its full potential.
Based on the EDA, customers between the ages of 36-55, with a doctorate or postgraduate degree, or female customers tend to attrite more. This suggests that competitive banks may be offering these customers better deals, leading them to use their credit cards less frequently with the current bank.
Exploratory data analysis also reveals that customers who have had a higher number of contacts with the bank in the last 12 months are more likely to attrite. This highlights the need to investigate whether there were any unresolved issues or concerns that led to customer dissatisfaction and ultimately, their departure from the bank.
import sklearn
print(sklearn.__version__)
1.2.2
import xgboost
print(xgboost.__version__)
2.0.3
from joblib import dump
# Save the best model to a file
model_filename = 'credit-card_users_churn_prediction_xgb_model.joblib'
dump(best_model, model_filename)
print(f"Model saved as {model_filename}")
Model saved as credit-card_users_churn_prediction_xgb_model.joblib
import pickle
# Save the best model to a file
model_filename = 'credit-card_users_churn_prediction_xgb_model.pkl'
with open(model_filename, 'wb') as file:
pickle.dump(best_model, file)
print(f"Model saved to {model_filename}")
from joblib import dump